Introduce the broad concepts of modeling building, i.e. building candidate models and selecting the optimal model.
Check if data exists:
In [1]:
!ls -l ../datasets/FuelEconomy/
In this study, we are only interested in data from 2010 and 2011. Use the DataFrame data type from pandas to store the files. It is very similar to the statistical package R's data frame.
In [2]:
from __future__ import division
import numpy as np
import pandas as pd
cars10 = pd.read_csv("../datasets/FuelEconomy/cars2010.csv")
cars11 = pd.read_csv("../datasets/FuelEconomy/cars2011.csv")
In [3]:
cars10.head(5)
Out[3]:
Check if there is any missing values 'NAN' in this dataset.
In [4]:
cars10.count()
Out[4]:
In [5]:
print cars10.shape
print cars11.shape
We only restrict ourselves to a single predictor 'EngDispl' and the response 'FE' in this introductory illustration.
In [6]:
cars10_feature = cars10.get(['EngDispl'])
cars10_target = cars10.get(['FE'])
cars11_feature = cars11.get(['EngDispl'])
cars11_target = cars11.get(['FE'])
In [7]:
cars10_feature.head(5)
Out[7]:
In [8]:
cars10_target.head(5)
Out[8]:
Generally, we want to first visulize the datasets to get a better understanding before doing anything crazy. Since there is one predicator, a simple scatter plot would do the trick. The characteristics from the visulization may suggest important and necessary pre-processing steps.
In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()
In [10]:
fig, (ax1, ax2) = plt.subplots(1, 2, sharey = True)
ax1.scatter(cars10_feature, cars10_target)
ax1.set_title('2010 Model Year')
ax2.scatter(cars11_feature, cars11_target)
ax2.set_title('2011 Model Year')
fig.text(0.5, 0.04, 'Engine Displacement', ha='center', va='center')
fig.text(0.06, 0.5, 'Fuel Efficiency (MPG)', ha='center', va='center', rotation='vertical')
Out[10]:
Because of the nature of this problem, i.e. predict the MPG for a new car line, we take the 2010 data as training set and the 2011 data as test set.
In [11]:
# Define the evaluation metric: root mean squared error (RMSE)
from sklearn.metrics import mean_squared_error
def rmse(y_actual, y_predicted):
'''calculate Root Mean Squared Error'''
return np.sqrt(mean_squared_error(y_actual, y_predicted))
A good starting point is the simple linear model $$y = \beta_0 + \beta_1x,$$ where $y$ is the Fuel Efficiency (MPG) and $x$ is the Engine Displacement.
In [12]:
# simple linear model
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(cars10_feature, cars10_target)
print "Least square estimate: intercept = {0}, coefficient ={1}".format(reg.intercept_, reg.coef_[0])
In [13]:
X = np.linspace(np.min(cars10_feature)[0], np.max(cars10_feature)[0])[:, np.newaxis]
y = reg.predict(X)
cars10_target_pred = reg.predict(cars10_feature)
y_range = np.linspace(np.min(cars10_target)[0], np.max(cars10_target)[0])[:, np.newaxis]
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.scatter(cars10_feature, cars10_target)
ax1.plot(X, y, 'r')
ax1.set_title('2010 Model Year')
ax1.set_xlabel('Engine Displacement')
ax1.set_ylabel('Fuel Efficiency (MPG)')
ax2.scatter(cars10_target, cars10_target_pred)
ax2.plot(y_range, y_range, 'r--')
ax2.set_xlabel('Observed')
ax2.set_ylabel('Predicted')
Out[13]:
The left-hand panel shows the training set data with a linear model fit defined by the estimated slope and intercept. The right-hand panel plots the observed and predicted MPG. These plots demonstrate that this model misses some of the patterns in the data, such as under-predicting fuel efficiency when the displacement is less than 2L or above 6L
In [14]:
# calculate root mean square error (RMSE)
from sklearn.cross_validation import cross_val_score
scores = np.sqrt(np.abs(cross_val_score(reg, cars10_feature, cars10_target, cv=10, scoring='mean_squared_error')))
print "RMSE: {0}".format(np.mean(scores))
Notice that simply re-predict the training set data is like to result in overly optimistic estimation of RMSE. An alternative approach for quantifying how well the model operates is to use resampling techniques, e.g. 10-fold cross-validation. We will cover that in Chapter 4.
Looking at the previous figure, it is conceivable that the problem might be solved by introducing some non-linearity in the model. The most basic approach is to supplement the simple linear model with additional complexity, e.g. $$y = \beta_0 + \beta_1 x + \beta_2 x^2,$$ which is referred to as quadratic model.
In [15]:
# quadratic model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
quad = make_pipeline(PolynomialFeatures(2), LinearRegression())
quad.fit(cars10_feature, cars10_target)
scores = np.sqrt(np.abs(cross_val_score(quad, cars10_feature, cars10_target, cv=10, scoring='mean_squared_error')))
print "RMSE: {0}".format(np.mean(scores))
Reduce in the RMSE may suggest that this model is a better fit to the data.
In [16]:
X = np.linspace(np.min(cars10_feature)[0], np.max(cars10_feature)[0])[:, np.newaxis]
y = quad.predict(X)
cars10_target_pred = quad.predict(cars10_feature)
y_range = np.linspace(np.min(cars10_target)[0], np.max(cars10_target)[0])[:, np.newaxis]
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.scatter(cars10_feature, cars10_target)
ax1.plot(X, y, 'r')
ax1.set_title('2010 Model Year')
ax1.set_xlabel('Engine Displacement')
ax1.set_ylabel('Fuel Efficiency (MPG)')
ax2.scatter(cars10_target, cars10_target_pred)
ax2.plot(y_range, y_range, 'r--')
ax2.set_xlabel('Observed')
ax2.set_ylabel('Predicted')
Out[16]:
One issue with quadratic models is that they can perform poorly on the extremes of the predictor. From the above figure, one might notice that predicting new vehicles with large displacement values may produce inaccurate results.
There are other approaches for creating sophisticated relationships between the predictors and outcome. One particular technique is the multivariate adaptive regression spline (MARS) model (Friedman (1991)). When used with a single predictor, MARS can fit separate linear regression lines for different ranges of engine displacement. This model, like many machine learning algorithms, has a tuning parameter which cannot be directly estimated from the data. While the MARS model has internal algorithms for making this dtermination, the user can try different values and use resampling to determin the appropriate value. Once the value is found, a final MARS model would be fit using all the training set data and used for prediction.
A Python module py-earth on Github implemented the MARS and is likely to be merged into sklearn in the near future (see this issue).
In [17]:
# MARS
from pyearth import Earth
mars = Earth()
mars.fit(cars10_feature, cars10_target)
scores = np.sqrt(np.abs(cross_val_score(mars, cars10_feature, cars10_target, cv=10, scoring='mean_squared_error')))
print "RMSE: {0}".format(np.mean(scores))
RMSE of MARS is similar to that of quadratic regression.
In [18]:
X = np.linspace(np.min(cars10_feature)[0], np.max(cars10_feature)[0])[:, np.newaxis]
y = mars.predict(X)
cars10_target_pred = mars.predict(cars10_feature)
y_range = np.linspace(np.min(cars10_target)[0], np.max(cars10_target)[0])[:, np.newaxis]
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.scatter(cars10_feature, cars10_target)
ax1.plot(X, y, 'r')
ax1.set_title('2010 Model Year')
ax1.set_xlabel('Engine Displacement')
ax1.set_ylabel('Fuel Efficiency (MPG)')
ax2.scatter(cars10_target, cars10_target_pred)
ax2.plot(y_range, y_range, 'r--')
ax2.set_xlabel('Observed')
ax2.set_ylabel('Predicted')
Out[18]:
Both quadratic model and MARS are evaluated on the test set.
In [19]:
X = np.linspace(np.min(cars11_feature)[0], np.max(cars11_feature)[0])[:, np.newaxis]
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.scatter(cars11_feature, cars11_target)
ax1.plot(X, quad.predict(X), 'r')
ax1.set_xlabel('Engine Displacement')
ax1.set_ylabel('Fuel Efficiency (MPG)')
ax1.set_title('Quadratic model')
ax2.scatter(cars11_feature, cars11_target)
ax2.plot(X, mars.predict(X), 'r')
ax2.set_xlabel('Engine Displacement')
ax2.set_ylabel('Fuel Efficiency (MPG)')
ax2.set_title('MARS')
Out[19]:
In [20]:
# RMSE
quad_scores = np.sqrt(np.abs(cross_val_score(quad, cars11_feature, cars11_target, cv=10, scoring='mean_squared_error')))
mars_scores = np.sqrt(np.abs(cross_val_score(mars, cars11_feature, cars11_target, cv=10, scoring='mean_squared_error')))
print "Quadratic model RMSE: {0} and MARS RMSE: {1}".format(np.mean(quad_scores), np.mean(mars_scores))
The first thing to notice is that both scores are very similar, which indicates that either model is appropriate for this task. Also, both scores are lower than their previous values (fitted to 2010 data) as we would expect.
There are several aspects of the model building process that are worth discussing further.
Feature selection: the process of determining the minimum set of relevant predictors needed by the model.
"No Free Lunch" Theorem - Try a wide variety of techniques then determine which model to focus on.
Rely on cross-validation and the test set to produce quantitative assessments of the models to make the choice.
To get a reliable, trustworthy model for predicting new samples, we must first understand the data and the objective of the modeling.